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Method for upgrading a data st ream of multimedia data 
State of the art 

The invention describes a method for upgrading a data stream 
of multimedia data, which comprises features with textual 
description. 

In order to exactly describe e.g. the pronunciation of a 
text, e.g. for controlling a speech synthesiser, the "World 
Wide Web Consortium" (W3C) is currently specifying a so- 
called "Speech Synthesis Markup Language" (SSML, 
http://www.w3.org/TR/apeech-synthesis). Within this 
specification, xml (Extensible Markup Language) elements are 
defined for describing how the elements of a te*t are to be 
pronounced exactly. 

For the phonetic transcription of text the "International* 
Phonetic Alphabet" (i PA > is used. The use of this phoneme 
element together with high level multimedia description 
schemes enables the content creator to exactly specify the 
phonetic transcription of the description text. However, if 
there are multiple occurrences of the same words in 
different parts of a description text, the phonetic 
description has to be inserted (and thus stored or 
transmitted) for each of the occurrences. 
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Object and advantages of the invention 

with the streps of claim 1 and the corresponding subclaims a 
more efficient phonetic representation of specific parts or 
words of high level, textual multimedia description schemes 
is enabled. 

This objective is achieved by means of the present invention 
in that in addition to the textual description a set of 
phonetic translation hints is included. These phonetic 
translation hints specify the phonetic transcription of 
parts or words of the textual description. The phonetic 
transcription enables applications like speech recognition 
or text to speech systems to cope with special cases where 
automatic transcription is not applicable or to completely 
cut out the process of automatic transcription. A second 
aspect of the invention i s the efficient binary coding of 
the phonetic translation hints values in order to allow low 
bandwidth transmission or storage of respective description 
data containing phonetic translation hints. 



Known solutions allow the phonetic transcription of specific 
parts or words of the description text for high level 
25 multimedia descriptions. However, the phonetic 

transcriptions have to be specified for each occurrence of a 
word or text part, i.e. if certain words occur more than 
once in a description text, the phonetic transcriptions have 
to be repeated each time. The present invention has the 
advantage that it allows to specify a phonetic transcription 
of specific parts or words of any description text within 
high level feature multimedia description schemes. In 
contrary to the state of the art, the present invention 
allows to specify the phonetic transcription of words which 
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are valid for the whole description text or parts of it, 
without requiring that the phonetic transcription is 
repeated for each occurrence of the word in the description 
text, in order to achieve this goal, a set of phonetic 
5 translation hints is included in the description schemes. 

These translation hints uniquely define how to pronounce 
specific words of the description text. The phonetic 
translation hints are valid for either the whole description 
text or parts of it, depending on which level of the 

10 description scheme they are included. By this, it is 

possible to only cnce specify (and thus transmit or store) 
the phonetic transcription of a set of word3, which is then 
valid for all occurrences of those words in that part of the 
text where the phonetic translation hints are valid. This 

IS makes the parsing of the descriptions easier, since the 

description text does no longer carry all the phonetic 
transcriptions in-line, but they are treated separately. 
Further, it facilitates the authoring of the description 
text, since the text can be generated separately from the 

20 transcription hints. Finally, it reduces the amount of data 

necessary for storing or transmitting the description text. 

Detailed description of the invention 

25 Before discussing the details of the invention some 

definitions, especially used in MPEG-7 are presented. 

in the context of the MPEG-7 standard that is currently 
under development, a textual representation of the 
30 description structures for the description of audio-visual 

data content in multimedia environments is used. For this 
task, the Extensible Markup Language (XML) i3 used, where 
the Ds and DSs are specified using the so-called Description 
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Definition Language (DDL) . In the context of the remainder 
of this document, the following definitions are used: 
•Data: Data ia audio-visual information that will be 
described using MPSG -7, regardless of storage, coding, 
5 display, transmission, medium, or technology. 

•Feature: A Feature is a distinctive characteristic of the 
data which signifies something to somebody. 

•Descriptor (DJ : A Descriptor ia a representation of a 
Feature. A Descriptor defines the syntax and the semantics 
10 of the Feature representation. 

.Descriptor Values (0V) : A Descriptor Value ia an 
instantiation of a Descriptor for a given data set (or 
subset thereof) that describes the actual data. 
.Description scheme (DS) : A Description scheme specifies the 
15 structure and semantics of the relationships between its 

components, which may be both Descriptors (Ds) and 
Description Schemes (DSa) . 

.Description: A Description consists of a DS (structure) and 
the sec of Descriptor Values (instantiations) chat describe 
20 the Data. 

•Coded Description: A Coded Description is a Description 
that has been encoded to fulfil relevant requirements such 
as compression efficiency, error resilience, random access, 



etc 



•Description Definition language (DDI): The Description 
Definition Language is a language that allows the creation 
of new Description Schemes and, possibly, Descriptors, it 
also allows the extension and modification of existing 
Description Schemes. 

The lowest level of the description is a descriptor. It 
defines one or more features of the data. Together with the 
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respective DVs it is used to actually describe a specific 
piece of data. The next higher level is a description 
scheme, which contains at least two or more components and 
their relationships. Components can be either descriptors or 
5 description schemes. The highest level so far is the 

description definition language. It is used for two 
purposes: first, the textual representations of static 
descriptors and description schemes are written using the 
DDL. Second, the DDL can also be used to define a dynamic DS 
10 using static Ds and DSs. 



With respect to the MPEG-7 descriptions, two kind of data 
can be distinguished. First, the low level features describe 
properties of the data like e.g. the dominant colour, the 
shape or the structure of an image or a video sequence. 
These features are. in general, extracted automatically from 
the data* On the other hand, MPEG-7 can also be used to 
describe high level features like e.g. the title of a film, 
the author of a song or even a complete media review with 
respect to the corresponding data. These features are, in 
general, not extracted automatically, but edited manually or 
3emi-automatically during production or post-production of 
the data, up to now, the high level features are described 
in textual form only, possibly referring to a specified 
language or thesaurus. A simple example for the textual 
description of some high level features is given below. 



<CreationIr.formation> 
<Creation> 
3 0 <Title type» M original"> 

<TitleText xrol: lang= tT ea">Mu9ic</TitleText> 
</7itle> 
<Creator> 

<Role CSName-"MPE6_roles_CS M C3TermID-"47 M > 
35 <Label xml : laftg= "en" >presentQr< /Label > 
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</Role> 

< Individual 

<Naifte>Madonna</Name> 

</Creaior> 
</Creation> 
<MediaReview> 

<Reviewer> 

<FirstName>Alan</FirstName> 

<Gi venNam«>Bang?< /Q ivenNair«> 
</Reviewer> 
<RatingCriterion> 

<CriterionName>Overall</CriterionName> 
<WorstRating>l< /Worst Rating* 
<Be s tRat ing>i 0 < / Bes tRa t ing> 

</RatingCriterion> 

<RatingValue>10</Ratingvalue> 

<FreeTextReview> 

This is again an excellent piece of music from our well- 
known superstar, without the necessity for more than 180 
bpm in order to make people feel excited. It comes along 
with harmonic yet clearly defined transitions between 
pieces of rap-like vocals, well known for e.g. from the 
Kraut-Rappers -Die f antaatlschen 4" and their former 
chart runner-up "Mf<3" ( and on the other hand peaceful 
sounding instrumental sections. Therefore this song 
deserves a clear 10+ rating. 

< /Fre eTextReview> 
</MediaReview> 
</CreationInformation> 



35 



The example uses the XML language for the descriptions. The 
text in the brackets £«<...>«) is referred to as XML tags, 
and it specifies the elements of the description scheme. The 
text between the tags are the data values of the 
description. The example describes the title, the presenter 
and a short media review of an audio track called "Music" 
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from the well known American Singer "Madonna". As can be 
seen, all the information ia given in textual form, possibly 
according to a specified language ("de" for German, or *en" 
for English} or to a specified thesaurus. The text 
describing the data can in principle be pronounced in 
different ways, depending on the language, the context or 
the usual customs with respect to the application area. 
However, the textual description as specified up to now is 
the same, regardless of the pronunciation. 



in order to exactly describe e.g. the pronunciation of the 
text, e.g. for controlling a speech synthesiser, the *World 
Wide web Consortium- (W3C) is currently specifying a so- 
called "Speech Synthesis Markup language" (ssml, 
http://www.w3.org/TR/speech-synthesis). Within this 
specification, xml elements are defined for describing how 
the elements of a text are to be pronounced exactly. Among 
others, a phoneme element is defined which allows to specify 
the phonetic transcription of text parts like described 
20 below. 

<phonem e ph=<'tü ra &ft25l;to&#28A;''> tomato </phoneme> 
<!-- This is an example of IPA using character entities — > 

25 <phoneme ph="turauto"> tomato </phoneme> 

<!— This example uses the Unicode IPA characters, — > 
<!-Note: this will not display correctly on most browsers — > 

As can be seen, for the phonetic transcription the 
30 "international Phonetic Alphabet" (IPA) is used. The use of 

this phoneme element together with high level multimedia 
description schemes enables the content creator to exactly 
specify the phonetic transcription of the description text. 
However, if there are multiple occurrences of the same words 
35 in different parts of a description next, the phonetic 
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description has to be inserted (and thus stored or 
transmitted) for each of the occurrences. 

The general idea of the presented invention is to define a 
new DS called PhoneticTranslationHints which gives 
additional information about how a set of words is 
pronounced. The current Textual Datatype, which does not 
include this information, is defined with respect to the 
MPEG-7 Multimedia Description Schemes CD as follows. 

<E~ Definicion of Textual Datatype 



10 



15 <complexType name** "Text ualType"> 

<simpleContent> 

<extension base» w 3tring t '> 

<attribute ref ="xml : CLang" use= w optional V> 
</extenaion> 
20 </simpleContent> 
</coraplexType> 

The Textual Datatype only contains a string for text 
information and an optional attribute for the language of 

25 the text. The additional information about how some or all 

words in an instance of the Textual Datatype are pronounced 
is given by an instance of the new defined 
PhoneticDecriptionHintsType. Two solutions for the 
definition of this new type are given in the following 

30 subsections. 



The first realisation of the PhcneticTranslationHintsType is 
given by the following definition 

ccomplexType name="PhoneticTranslationflintsTyp6"> 
< sequence rnaxOccur$="unbQunded"> 
<element name=* f Word"> 
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<complexType> 

<simpleContent> 

<exteri9ion base="string"> 

<attr ibute name="phonetic_t ran slat ion" 
type="string" 

use="required"/> 

</exterxsion> 
</simpl eContents 
</complexType> 
</element> 
</sequence> 
< / complexT ype > 



15 



The semantics of the new defined 

PhoneticTranslationHintsType are described in the following 
table ♦ 



20 



Name 


Definition 


Phonet icTrans lationHint s 


Contains a set of words and their 
corresponding pronunciations. 


Word 


Single word coded as string. 


Phonetic_translation 


This element contains the 
additional phonetic information 
about the corresponding text. For 
the representation of the 
phonetic information, the IPA 
(International Phonetic Alphabet) 
or the SAMPA representation are 
chosen. 



This new created type unambiguously gives a connection 
between words and their appropriate pronunciation. In the 
following, an example with an instance of the 
PhoneticTranslationHintsType is given which refers to the 
example discussed before. 



@ 



- 10 - 



R. 40159 



10 



<PhoneticTranslationHints> 
<Word 

P^^^tranalation-T, & ti52; P 4#211miUa8A,a«0<3->bi«/ltord> 

<Word PhonetiQ.traRalation--Icr4fr372;r«*0H;p. fi #290;->K«ut- 
Rapper s</Word> 

<Word Ph»netic^r a n sl ation=- a m S #001 J ef 4 #005,« 4 #011,-> M ro</W O rd> 
</PhoneticTranalationHinta> 

With this instance of the PhoneticTranslationHintsType an 
application now knows the exact phonetic transcription of 
some or all word* of the text which is given between the 
<JfreeTextRevi e w>- tags in the example discussed before . 

The second realisation of the PhoneticTranslationHintsType 
rs given by the following definition. 

<oonpl e xType name="Phon 6 ticTran S l a tionHint S Typ a »> 
20 <sequence max0ccur9»"unbounded"> 

<element name="Word" type=~string"/> 
<element name=»PhoneticTranslation"/> 
</aequence> 
</eoroplexType> 



IS 
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The semantics of the new defined 

PhoneticTranslationHintsType, which are the same as in the 
version i described in the previous section, are specified 
in the following table. 



35 
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Name 


Definition 


Phone t i c?r an s 1 a t i onHin t s 


Contains a set of words and their 
corresponding pronunciations. 


Word • 


Single word coded as string. 


Phone t i cjt rans 1 at ion 


This element contains the 
additional phonetic information 
about the corresponding text. For 
the representation of the 
phonetic information, the I PA 
(International Phonetic Alphabet) 
or the SAM PA representation are 
chosen* 



In the following, an example with an instance of the 
PhoneticTranslationHintsType version 2 is given, which 
refers again to the example discussed before. 



<PhoneticTranslationfiinta> 
<Word>bpm< /Word> 

<phonetic_translation>h&#152;pfi.tt211;mi4«2aA/n&#04 3</phonetic_tr 
10 anslation> 

<W ord> Kraut -Rapper s< /Wo rd> 
<phonet i c__t rana lat ion> 
kr&#372 ; r 4# Oil ; pe6#2 90 ; </phonetic_tranalation> 
<Word>MFG< /Word> 
15 <phonetic_translation> 

em&#OOl;efS t fOO5;g&#0ll? </phonetic_trar.slation> 

<j PhoneticTransia ti onHint s> 



20 



With this new definition of the PhoneticTranslationMintsType 
an instance of this type consists of the tags <Word> and 
<PhoneticTranslation> which always correspond to each other 
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and build one unit that describes a text and its associated 
phonetic transcription. 

The phonemes used in the above described phonetic 
translation hints DSs are in general described also as 
printable characters using UNICODE presentation. However, in 
general the set of used phonemes will be restricted to a 
limited number. Therefore, for more efficient storage and 
transmission a binary fixed length or variable length code 
representation can be used for the phonemes, which 
eventually takes into account the statistics of the 
phonemes. 

The additional phonetic transcription information is 
necessary for a huge amount of applications, which include a 
ITS functionality or speech recognition system. In fact the 
speech interaction with any kind of multimedia system is 
based on a single language, normally the native language of 
the user. Therefore the HMI (the known vocabulary) is 
adapted to this language. Nevertheless, the words which are 
used from the user or which should be presented to the user 
can also include terms of another language. Thus, the TTS 
system or speech recognition does not know the right 
pronunciation for these terms. Osing the proposed phonetic 
description solves this problem and makes the hmi much more 
reliable and natural. 

A multimedia 3yatem providing content of any kind to the 
user needs such phonetic information. Any additional text 
information about the content can include technical terms, 
names or other words needing a special pronunciation 
information to present it to the user via its. The same 
holds for news, emails or other information which should be 
read to the user. 
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Especially a film or music storage device, which can be a 
CD, CD-ROM, DVD, MP3, MD or any other device, contains a lot 
of films and songs with a title, actor name, artist name, 
genre, etc. The TTS system does not know how to pronounce 
all these words and the speech recognition can not recognise 
such words, if the user for example wants to listen to pop 
music and the multimedia system should give a list of 
available pop music via TTS, it would not be able to 
pronounce the found CD titles, artist names or song names 
without additional phonetic information. 

if the multimedia system should present (via text-to-speech 
interfaces (TTS) ) a list of the available film or music 
genres, it also needs this phonetic transcription 
information. The same also holds for the speech recognition 
to better identify corresponding elements of the textual 
description. 

Another application is the radio (via FM, DAB, DVB, RDM, 
etc.). if the user wants to listen to th© radio and the 
system should present a list of the available programs, it 
would not be possible to pronounce the programs, because the 
radio programs have names like "BBC", or "WDR". Others have 
a name using normal words like "tateime Bayern" and some 
names are a mixture of both, e.g. M N-Joy". 



The telephone application often provides a telephone book. 
Even in this case without phonetic transcription information 
30 the system can not recognise or present the names via TTS, 

because it does not know how to pronounce it. 
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So any functionality or application which presents 
information to the user via TTS or which uses a speech 
recognition needs a phonetic transcription for some words. 

Optionally it is possible to transmit the reference on any 
given alphabet, which is used to represent the phonetic 
element . 

The translation hints together with the corresponding 
elements of the textual description can be implemented in 
text-to-speech interfaces, speech recognition devices, 
navigation systems, audio broadcast equipment, telephone 
applications, etc,, which use textual description in 
combination with phonetic transcription information for 
search or filtering of information. 



15 



R. 40159 



09,01.01 Sk/Zj 

ROBERT BOSCH GMBH, 704 42 Stuttgart 



Claims 

1. Method for upgrading a data stream of multimedia data, 
which comprises features with textual description, 
characterized in that in addition to the textual 
description a set of phonetic translation hints is 
included in the data stream, which specify the phonetic 
transcription of parts or words of the textual 
description. 

2. Method according to claim 1, characterized in that a 
phonetic translation hint is followed by a word and its 
corresponding phonetic transcription. 

3. Method according to one of claims 1 or 2, characterized 
in that a phonetic translation hint with the phonetic 
transcription of a word is valid for the whole textual 
description or parts of it without requiring that the 
phonetic transcription is repeated for each occurrence of 
the word for which the transcription is given in the 
textual description, 

4. Method according to one of claims 1 to 3, characterized 
in that the phonetic translation hints are embedded in an 
MPEG-, e.g. MPEG-7-, datastream associated with textual 
type descriptors . 
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5. Method according to one of claims 1 to 4, characterized 
in that for the representation of phonetic transcription 
information reference on an alphabet in a given code 
format, e.g. the IPA (International Phonetic Alphabet, or 
SAMPA, is made. 

6. Method according to one of claims i to 5 , characterized 
in that the phonemes used in the phonetic translation 
hints are restricted to a limited number. 

7. Method according to claim 6, characterized in that a 
binary fixed length or variable length code 
representation is used for the phonemes. 

8. Method according to claim 7, charactered in that coding 
of the phonemes takes into account the statistics of the 
phonemes . 

9. Method according to one of claims 1 to 6, characterized 
in that the translation hints are stored in a speech 
recognition system to better identify corresponding 
elements of the textual description. 

10. Method according to one of claims 1 to 8, characterized 
in that the translation hints together with the 
corresponding elements of the textual description are 
implemented in text-to-speech interfaces, speech 
recognition devices, navigation systems, audio broadcast 
equipment, telephone applications, etc., which use 
textual description in combination with phonetic 
information for search or filtering of information. 
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10 Method for upgrading a data stream of multimedia data 

Abstract 

For upgrading a data stream of multimedia data, which 
is comprises features with textual description, a set of 

phonetic translation hints is included in the data stream, 
which specify the phonetic transcription of parts or words 
of the textual description. The phonetic transcriptions have 
not to be repeated for each occurrence of a word. This 
20 reduces the account of data necessary for storing or 

transmitting the description text. 
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